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Abstract 

We consider the zero-error capacity of deletion channels. Specifically, we consider the setting 
where we choose a codebook C consisting of strings of n bits, and our model of the channel 
corresponds to an adversary who may delete up to pn of these bits for a constant p. Our goal 
is to decode correctly without error regardless of the actions of the adversary. We consider 
what values of p allow non-zero capacity in this setting. We suggest multiple approaches, one 
of which makes use of the natural connection between this problem and the problem of finding 
the expected length of the longest common subsequence of two random sequences. 

1 Introduction 

There has recently been a great deal of work studying variations of the following channel: n bits are 
sent, but each bit is independently deleted with fixed probability p. This is the binary independently 
and identically distributed (i.i.d.) deletion channel, also called the binary deletion channel or ^ust 
the deletion channel. To be clear, a deletion is different from an erasure: if 10101010 was sent, 
the receiver would obtain 10011 if the third, sixth, and eighth bits were deleted, but would obtain 
10?01?1? if the bits were erased. The capacity of this channel remains unknown. See [lOt |12j for 
recent overviews. 

In this work, we consider the zero-error capacity for the related setting where up to pn bits can 
be deleted. For brevity, we refer to the this variation as the adversarial deletion channel, as we 
can imagine the channel as an adversary deleting bits in an attempt to confuse the decoder, and 
the goal is choose a codebook that allows successful decoding against this adversary. Our main 
question of interest is to consider what values of p allow non-zero capacity in this setting. It is 
immediately clear, for example, that we require p < 0.5, as an adversary that can delete half the 
bits can arrange for any sent string to be received as either a string of all O's or all I's. 

Although this question appears quite natural, it does not seem to have been specifically tackled 
in the literature. It is certainly implicit in the early work on insertion and deletion channels, most 
notably the seminal works by Levenshtein^ and Ullman [18j , although in most early papers the 
focus was on a constant number of deletions or insertions, not a constant fraction. The question 
of codes that can handle a constant fraction of worst-case insertions or deletions was specifically 
considered by Schulman and Zuckerman p3], but they did not focus on optimizing the rate. 

We consider two approaches that give provable lower bounds on what values of p allow non-zero 
capacity. Our first bound is derived from a simple combinatorial argument. We provide it as a 
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baseline, and because it may be possible to generalize or improve the argument in the future. Our 
better bound arises from considering the expected length of the longest common subsequence (LCS) 
of two random sequences, a subject which has come under study previously. (See, for example, the 
work of Lueker [8] and the references therein.) 

Formally, subsequences are defined as follows: given a sequence X = {xi^X2i ■ ■ ■ the se- 
quence Z = (zi,2;2, . . . ..Zm) is a subsequence of X if there is a strictly increasing set of indices 
ii,i2, . . ■ ,im such that 

Zj = Xi- . 

Given two sequences X and Y, the longest common subsequence is simply a subsequence of X 
and Y of maximal length. The connection between the adversarial deletion channel and longest 
common subsequences is quite natural. If two strings of length n have an LCS of length at least 
(1 — p)n, then they cannot both be simultaneously in the codebook for an adversarial deletion 
channel, as the adversary can delete bits so that the received string is a subsequence common to 
both strings. 

It is known that the expected length of the LCS of two sequences of length n chosen uniformly 
at random converges to jn for large n and a constant 7, as we explain more fully below. Lueker 
shows 7 < 0.8269 |i8j. Our main result is that this implies that for any p < 1 — 7, we can achieve 
a non-zero rate for the adversarial deletion channel. Hence we can prove that for p < 0.1731, 
a non-zero rate is possible. In fact we generalize our argument, considering the expected of the 
LCS of two sequences chosen in an alternative, but still random, fashion. While we do not have 
provable bounds based on this generalization, experiments suggest that the approach allows further 
significant improvements on the bounds, as we explain in Section [4j 

2 A Simple Combinatorial Bound: The Bipartite Model 

Our first approach is to work with a bipartite graph that represents both the possible codewords 
and the possible received strings. For notational convenience, let us treat pn as an integer, and let 
us assume without loss of generality that the adversary deletes a full pn bits. (The arguments for 
an adversary that delete up to pn bits differ only in o(l) terms.) For the rest of this section, our 
bipartite graph will have a set S of vertices, with one for each of the 2"" possible n-bit codewords, and 
a set R of vertices, with one for each of the the 2^^~^'*" possible received strings. An edge connects a 
vertex t> of S to a vertex w Riiw is a subsequence of v. A valid codebook consists of a subset C of 
S with the property that every t;; G i? is a neighbor of at most one vertex in C. We derive a bound 
using this graph in terms of the binary entropy function H{p) = —plog2P — (1 — p) log2(l — p). 

Theorem 2.1 The capacity of the adversarial deletion channel is greater than zero whenever 1 — 
2H{p)+p> 0. 

Proof: A standard fact regarding subsequences is that each vertex of R appears as a subsequence 
of 

pn / \ 

vertices of S [2\ . (The correct value can be seen by considering the vertex R of all Os; the proof of 
the equality is a simple induction.) Hence on average a vertex of 5* is adjacent to 2(^~P)"y/2" 
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2n{H(p)-p+o{i)) vertices of R. In particular we now consider the set S' of vertices that are adjacent 
to at most twice the average of the vertices of S, and this set is at least half the size of S. 

We choose vertices of S' sequentially to obtain our codebook in the following manner. We choose 
an arbitrary vertex from S'. When we choose a vertex of S' , it is adjacent to at most 2"'^^^p^~p~^"^^^^ 
vertices of R. We remove from 5' the chosen vertex, as well as any vertex whose neighborhood 
intersects that of the chosen vertex. This removes at most y2^^^^P^~P^"^^^^ vertices from S' . Con- 
tinuing in this manner, we find a sequence of at least |5'|/(2/2"(^(p)-p+°(i))) > 2"(i-2^^(p)+p-°(i)) 
codewords. In particular, we can achieve positive rate for p that satisfies 1 — 2H{p) + p > 0. ■ 



This yields a positive rate for the adversarial deletion channel for p up to about 0.1334. 

Before moving on, we make some remarks. First, Theorem 2.1 can be generalized in a straight- 
forward manner to larger alphabets. The important step is that for alphabets of size K, each string 
of length j appears in 

i / \ 

n ' 



E 

i=0 



\{K-iy 



supersequences of length n. 

Second, because of its generality, one would hope that this argument could be improved to 
yield significantly better bounds. As an example, we have considered a modified argument where 
we restrict our initial set 5* of possible codewords to consist of exactly an alternating blocks of Os 
and Is, where a < 1 — p. The idea is that a smaller number of blocks, corresponding to a higher 
average block length, might yield better results; such arguments have proven useful in the past 
[21 S]. Our modification, however, did not appear to change our derived bound significantly (we 
did not perform a full optimization after calculations showed the bound appeared to remain below 
0.134 with this choice of S). 

Obtaining improved results would appear to require a better understanding of the structure of 
the bipartite graph: in particular, greater insight into the degree distribution and neighborhood 
structure of the vertices of S. We circumvent this barrier by moving to an alternate representation. 



3 A Bound from Longest Common Subsequences 

Our improved bound makes use of known results on longest common subsequences of random 
strings. As stated earlier, if two strings of length n have an LCS of length at least (1 — p)n, then 
they cannot both simultaneously be in the codebook for an adversarial deletion channel. This 
suggests a natural attack on the problem. Consider the graph where there is a vertex for each n-bit 
string. The edges of this graph will connect any pair of vertices that share a common subsequence 
of length at least (1 — p)n. Then an independent set in this graph corresponds to a valid codebook 
for the adversarial deletion channel, in that no two codewords can be confused by only pn deletions. 
We remark that this connection between codes and independent sets is fairly common in the setting 
of deletion channels; see, for example, [T5J . 

We require some basic facts, which we take from [8J; for an early reference, see also f2|. Let 
L{X, Y) denote the length of the LCS of two strings X and Y, and let Ln be the length of the 
LCS of two strings chosen uniformly at random. By subadditivity one can show that there exists 
a constant 7 > such that 

lim = 7. 

n— ^00 ji 
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The exact value of 7 is unknown; [8] finds what appears to be the best known bounds, of 0.788071 < 
7 < 0.826820, and computational experiments from p] suggest 7 ~ 0.8182. 

The relationship between 7 and the threshold for zero-error decoding is given by the following 
theorem. 

Theorem 3.1 The capacity of the adversarial deletion channel is greater than zero whenever p < 
1-7. 

Proof: For any 7* > 7, for n sufficiently large we have that L„ < j*n. Further, standard 
concentration inequalities show that the length of the LCS of two strings X and Y chosen uniformly 
at random is concentrated around its expectation L„. Such results are straightforward to derive; 
see, for example, [1]. Formally, let Zi be the pair of bits in position i of the two strings. The we 
have a standard Doob martingale given by Ai = E[L(X, Y) \ Zi, Z2, ■ ■ ■ , Zi]. The value of the bits 
Zi can change the value of Ai by at most 2; one can simply remove these bits if they are part of 
the LCS, so their value can only change the expected length of the LCS by at most 2. Applying 
standard forms of Azuma's inequality (see, for example, jl3j [Chapter 12] as a reference), we can 
conclude 

Pr [L{X, Y)>Ln + en] < 2-^(^)" 

for some function f of e (that can depend on q). 

Now, let us focus on the graph described previously, where vertices correspond to strings of 
length n, and edges connect vertices that share an LCS of length at least 7'n for some 7' > 7. It 
follows from the above that for sufficiently large n the number of edges in the graph is at most 




for some constant c. This is because the probability a randomly chosen edge is in the graph is at 
most 2~'f^^^"' for an appropriate e. (The probabilistic statement derived from Azuma's inequality 
would include edges that correspond to self-loops, but our graph does not includes self-loop edges; 
since we seek an upper bound on the number of edges, this is not a problem.) 

We can now apply a standard result on independent sets, namely Turan's theorem, which 
provides that in a graph with j vertices and k edges, there is an independent set of size at least 
P /{2k + 1) [T7]. In this context, Turan's theorem yields the existence of an independent set, or 
codebook, of size at least 

^ ncn—1 

2(2)2-'=" + 1 

That is, we have a code of positive rate whenever the fraction of deleted bits p is at most 1 — 7', 
giving a non-zero rate for any p > 1 — 7. ■ 



This yields a provable positive rate for the adversarial deletion channel for p up to about 0.1731, 
and an implied bound (assuming 7 ~ 0.8128) of 0.1872. 

We note that while Theorem 2.1 uses a bipartite graph, it should be clear that the process 
described greedily finds an independent set on the corresponding graph used in Theorem 3.1 , where 
edges on the 2" possible codewords correspond to paths of length 2, or to shared subsequences, 
on the bipartite graph. Theorem 2.1 results in weaker bounds, which is unsurprising, given the 
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sophisticated technical machinery required to obtain the bounds on the expected LCS of random 
sequences. 

Theorem 3.1 can also be generalized to larger alphabets, as the techniques of [8j provide bounds 
on the corresponding constant for the expected LCS of strings chosen uniformly at random from 
K-aiy alphabets. 

It might be tempting to hope that 1 — 7 is a tight threshold. Consider the graph obtained 
when p is slightly larger than 1 — 7; this graph will be almost complete, as almost all edges will 
exist, and therefore it might seem that there might be no independent set of exponential size to 
find. However, it might be possible to obtain a large subgraph that is less dense, which would in 
turn yield a better threshold. For example, the string consisting of all Os has very small degree 
(comparatively) in the graph; by finding a suitable set of "small degree" vertices, one might hope 
to increase the threshold for non-zero capacity. 

In the next section, we use this insight to generalize our argument above in order to obtain, at 
least empirically, improved bounds. 



4 Using First-Order Markov Chains 



To generalize the argument of Theorem 3.1, we consider random strings generated by symmetric 
first-order Markov chains. That is, the initial bit in the string is or 1 with probability 1/2; 
thereafter, subsequent bits are generated independently so that each matches the previous bit with 
probability g, and changes from h to 1 — b with probability 1 — (?. When q = 1/2, we have the 
standard model of randomly generated strings. Below we consider the case where q > 1/2. Again, 
first-order Markov chains have proven useful in the past in the study of deletions channels ^ |3] , 
so their use here is not surprising. 

Subadditivity again gives us that under this model, the expected longest common subsequence 
of two randomly generated string again converges to 7^/1 for some constant 7^. We claim the 
following theorem. 

Theorem 4.1 For any constant q, the capacity of the adversarial deletion channel is greater than 
zero whenever p < 1 — ^g. 



Proof: Following the proof of Theorem 3.1 , let L„ be the expected length of the LCS of two strings 
X and Y of length n chosen according to the above process. 

For any 7* > 7^, for n sufficiently large we have that Ln < 7*n. As before, we can make use 
of concentration, although the use of Azuma's inequality is a bit more subtle here. Again, let Zi 
be the pair of bits in position i of the two strings, so that we have a standard Doob martingale 
given by Ai = 'E[L{X,Y) \ Zi, Z2, ■ ■ ■ , Zi]. The value of the bits Zi are important, in that the 
expected value of the LCS conditioned on these bits depends non-trivially on whether the bits are 
the same or different. However, we note that if the bits do not match, then after an expected 
constant number of bits (with the constant depending on q), the bits will match again, and hence 
the value of the bits Zi still can change the value of Ai by at most a constant amount. So as before 
we, we can conclude 

Pr [L{X, Y)>Ln + en] < 2~^^''>'' 

for some function / of e. 

We now use a probabilistic variation of Turan's theorem to show the existence of suitably large 
independent sets in the graph where vertices correspond to strings of length n, and edges connect 
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vertices that share an LCS of length at least 7'n for some 7' > 7q- Our proof follows a standard 
argument based on the probabilistic method (see, e.g., [13j [Theorem 6.5]). For each sequence s of 
length n, let A(s) be the number of times the sequence changes from to 1 or 1 to when read 
left to right. Then the probability of s being generated by our first order Markov chain process is 

Consider the following randomized algorithm for generating an independent set: delete each vertex 
in the graph (and its incident edges) independently with probability 1 — 2"zp(s) for a 2: to be 
determined later, and then for each remaining edge, remove it and one of its adjacent vertices. 
If Q is the number of vertices that survive the first step, we have 

E[Q]= 2"zp(s) = z2". 

se{o,i}" 

Now let R be the number of edges that survive the first step. Then 

B[R] < z^2''yis)pit)l{Lis,t) > in). 

s,te{o,i}" 

Notice that E[-R] is bounded by z^2^'' ■ Pr[L{X, Y) > 7'n] for a random X and Y chosen according 
to the first order Markov process. As argued previously, we therefore have E[R] < ^^2^^"'^)" 
for some constant c. Choosing z = 2^^"""^^ gives an expected independent set of size at least 
E[Q — R] > 2™~^, showing that there must exist an independent set which gives a non-zero rate 
for any p < 1 — 7g . 

It remains to check that in fact our choice of z ensures that 1 — 2'^zp[s) is indeed a valid 
probability; that is, we need 2'^zp{s) < 1 for all s. We have 2^zp{s) = 2'^"~^p(s); in our argument 
we can in fact choose c to be a sufficiently small constant (specifically, choose c < log2(l/g)) to 
ensure that this is a valid probability. ■ 

While we do not have a formal, provable bound on 7^ for q > 1/2, such bounds for specific 
q might be obtained using the methods of [8]. We leave this for future work. Besides requiring 
generalizing these methods to strings chosen non-uniformly, we note that another difficulty is that 
these techniques require large-scale computations where the tightness of the bound that can be 
feasibly obtained may depend on q. 

Empirically, however, we can simply run our first-order Markov chains to generate large random 
strings and compute the LCS to estimate the behavior as g — >• 1. (For more detailed estimation 
methods for LCS problems, see for example [9].) Our experiments suggest that the LCS of two 
sequences randomly generated in this manner may converge to 0.75n or something near that quan- 
tity. For example, we found the length of the LCS for 1000 pairs of sequences of length 100000 
with q = 0.95; the average length was 77899.4, with a minimum of 77499 and a maximum of 78375. 
For 1000 pairs of length 100000 with q = 0.99, the average length of the LCS was 77479.8, with 
a minimum of 76083 and a maximum of 78831. For 1000 pairs of length 100000 with q = 0.999, 
the average length of the LCS was 75573.2, with a minimum of 68684 and a maximum of 81483. 
Naturally the variance is higher as runs of the same bit increase in length; however, the trend seems 
readily apparent. 



It is also worth noting that Theorem 4.1 can generalize beyond choosing vertices according 



to a distribution given by first-order Markov chains. The key aspects - subadditivity providing a 
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limiting constant, concentration using Azuma's inequality, and some reasonable variation of Turan's 
theorem based on the probabilistic method - might naturally be applied to other distributions over 
n-bit strings as well. Indeed, we emphasize that even if it can be shown that 7^ approaches 0.75 as 
q approaches 1 , it may still be that this does not bound the full range of values of -p for which the 
adversarial deletion channel has nonzero capacity. 

5 Conclusion 

We have given a proof that for any -p < 1 — jq the rate of the adversarial deletion channel is positive, 
where 7^ is the value such that two strings of length n chosen according to a symmetric first-order 
Markov chain that flips bits with probability 1 — q has, in the limit as n goes to infinity, an LCS of 
expected length 7gn. For q = 1/2, bounds on 7^ are known. 

Many natural open questions arise from our work. Can we find tighter bounds for 7? Can 
we derive a stronger bound using 7^? Are there better approaches for choosing codebooks in this 
setting? What more can we say about the structure of the graphs underlying our arguments? All 
of these questions relate to our main question here: can we improve the bounds on p for which the 
adversarial deletion channel has non-zero rate? 

Our theorems also implicitly give bounds on the rate (or equivalently the size of the codebook) 
obtainable away from the threshold. We have not attempted to optimize these bounds, as we expect 
our arguments are far from tight in this respect. For example, we note that in Theorem |3.1[ the 
bound on the size of the independent set obtained can seemingly be strengthened; specifically, the 
probabilistic method gives a lower bound on the size of the independent set of a graph with vertex 
set V and vertex degrees d{u) for u ^ V of J2uev d{u)+i P^ - Such bounds seem difficult to use in 
this context, and would would not appear to change the threshold obtained with this proof method 
without additional insights. Finding good bounds on the rate for this channel, and of course finding 
good explicit codes, remain open questions. 

Finally, we mention that for a large class of adversarial channels known as causal adversary or 
memoryless channels, it is known that no positive rate is achievable for p > 0.25 0|6]. Proving an 
equivalent upper bound for the adversarial deletion channel is a tantalizing goal, especially in light 
of the result from Section [4] suggesting that for p close to 0.25, positive rate is achievable. 

Unfortunately, it is not clear how to extend the techniques of [U [6] to the adversarial deletion 
channel. At a high level, the results of O |6] rely on the fact that, given two codewords x and x' 
with Hamming distance d, an adversary can cause a "collision" between x and x' by flipping d/2 
bits of each. So as long as the adversary can find two codewords of Hamming distance at most n/2, 
just n/4 bit flips are needed to cause a collision. 

In contrast, it is possible for two n-bit strings x and x' of Hamming distance d to have no 
common subsequence of length n — d + 1. That is, it might require d deletions to both x and x' 
to cause a collision, while it would only require d/2 bit-flips. This appears to be the fundamental 
difficulty in proving that for any specific p < 0.5, no positive rate is achievable for the adversarial 
deletion channel. Proving such an upper bound is perhaps our biggest open question. 
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